if (!requireNamespace("dplyr", quietly = TRUE))
install.packages('dplyr')
if (!requireNamespace("tidyverse", quietly = TRUE))
install.packages('tidyverse')
if (!requireNamespace("Seurat", quietly = TRUE))
install.packages('Seurat')
if (!requireNamespace("colorBlindness", quietly = TRUE))
install.packages('colorBlindness')
if (!requireNamespace("RColorBrewer", quietly = TRUE))
install.packages('RColorBrewer')
if (!requireNamespace("cluster", quietly = TRUE))
install.packages('cluster')
if (!requireNamespace("viridis", quietly = TRUE))
install.packages('viridis')
if (!requireNamespace("ggplot2", quietly = TRUE))
install.packages('ggplot2')
if (!requireNamespace("ggalluvial", quietly = TRUE))
install.packages("ggalluvial")
if (!requireNamespace("tidygraph", quietly = TRUE))
install.packages("tidygraph")
if (!requireNamespace("ggraph", quietly = TRUE))
install.packages("ggraph") 6 - Clustering
Introduction
Clustering is the process of grouping cell identities based on their gene expression profiles and takes place after dimensionality reduction. For scRNAseq data we use community detection algorithms, this is done by computing a K-nearest-neighbor graph and then grouping cells based on their shared nearest neighbors. To effectively obtain fine grained annotations, it should be done iteratively so the least amount of information is lost or misidentified.
In this notebook, we learn:
Clustering in Seurat requires the computation of a K-nearest-neighbor graph followed by the identification of clusters in this graph using the functions
FindNeighbors()andFindClusters().The
kparameters inFindNeighbors()determines the connectivity of the graph. A higher the K sets a more highly connected graph which results in fewer clusters.The
resolutionparameter inFindClusters()controls the granularity of the clustering. A higher resolution results in more clusters.Clustering results can be evaluated quantitatively by looking at the sample and batch distribution across clusters, using the Simpson diversity index, and a silhouette analysis of the clusters.
Review
We left off (last session) filtering out the poor quality data without regard for its distribution. Last week, we learned how principal components are calculated, what a latent space is, and defined a kNN (k-nearest neighbors) graph.
To review those:
We take a cell x gene matrix, normalize it, and reduce its dimensions using PCA. We looked at the Elbow plot to determine the optimal number of PCs to capture the variance. After selecting the top 30 PCs, we generated a k-nearest neighbors (kNN) graph to represent the relationships between cells based on their gene expression profiles and ran FindClusters() to identify clusters of cells based on their neighborhood relationships. Kyle introduced Harmony as an integration technique to correct for batch effects and I just went over QC metrics including doublet detection that help us understand the quality and content of our data.
Glossary
PCA - A dimensionality reduction technique that reduces the number of dimensions in a dataset while retaining as much variance as possible. The first principal component captures the axis with most variance in the data, each subsequent component captures the next independent/orthogonal axis of variability accounting for the most variance left.
Latent Space - The latent space is the low-dimensional representation of the data that captures the most important features.
kNN Graph - A graph that represents the relationships between cells based on their gene expression profiles. Each cell is connected to its k (1, 20, 100) nearest neighbors in the graph.
SNN Graph - A graph that represents how many neighbors are shared between 2 cells. This ensures a more robust graph to outliers.
Resources
Load Libraries and Data
library(dplyr)
library(Seurat)
library(tidyverse)
library(RColorBrewer)
library(colorBlindness)
library(DoubletFinder)
library(cluster)
library(viridis)
library(ggplot2)
library(ggalluvial)
library(tidygraph)
library(ggraph)
set.seed(687)Load Data
# Load the Seurat object with doublet and batch information
se <- readRDS('../data/Covid_Flu_Seurat_Object_Quality.rds')
se An object of class Seurat
33234 features across 59572 samples within 1 assay
Active assay: RNA (33234 features, 3000 variable features)
3 layers present: counts, data, scale.data
2 dimensional reductions calculated: pca, umap
# Set color palette for cell types
pal <- paletteMartin
names(pal) <- sort(unique(se$Celltype))
donor_pal <- c(
"#66C2A4", "#41AE76", "#238B45", "#006D2C",
"#41B6C4", "#1D91C0", "#225EA8", "#253494", "#081D58",
"#FFEDA0", "#FED976", "#FEB24C", "#f79736", "#FD8D3C",
"#FC4E2A", "#E31A1C", "#BD0026", "#ad393b", "#800000", "#800050")
names(donor_pal) <- c(
"Normal 1", "Normal 2", "Normal 3", "Normal 4",
"Flu 1", "Flu 2", "Flu 3", "Flu 4", "Flu 5",
"nCoV 1", "nCoV 2", "nCoV 3", "nCoV 4", "nCoV 5",
"nCoV 6", "nCoV 7", "nCoV 8", "nCoV 9", "nCoV 10", "nCoV 11"
)Set Up
PCA
Let’s first look at the data according to the top 2 principal components.
celltype_pca <- DimPlot(
se,
reduction = "pca",
group.by = 'Celltype',
cols = pal
)
celltype_pcaConstruct the kNN graph with FindNeighbors():
FindNeighbors() is the Seurat function that calculates the k-nearest neighbors of each cell in PCA space. The number of neighbors used for this calculation is controlled by the k.param parameter with a default value of 30. It computes pairwise distances between cells based on their gene expression using algorithms such as: euclidean distance, manhattan distance, Pearson correlation distance, or cosine similarity.
From BioStars:
FindNeighbors() is a function that is used to find the nearest neighbors of your single cell data point within a dataset. It works by calculating the neighborhood overlap (Jaccard index) between every cell and its k.param nearest neighbors. It’s often employed in various applications such as anomaly detection and dimensionality reduction. The concept is that given a data point, you want to identify the closest data points to it based on some similarity metric, such as Euclidean distance or cosine similarity. This helps to identify similar points in the dataset, which can be useful for making predictions or understanding the distribution of the data.
The default values of FindNeighbors() are:
See ?FindNeighbors for more information. Let’s modify these to fit our analysis:
se <- se %>%
FindNeighbors(
reduction = "pca",
dims = 1:20,
k.param = 20,
verbose = FALSE
)# Look at the k-nearest neighbors (nn) and shared nearest neighbors (snn) graphs computed
se@graphs$RNA_nn
A Graph object containing 59572 cells
$RNA_snn
A Graph object containing 59572 cells
FindClusters
FindClusters() is used for identifying clusters of cells based on their neighborhood relationships typically obtained from PCA or t-SNE. It inputs the graph made from FindNeighbors() and outputs cluster assignments for each cell found at se@meta.data$seurat_clusters.
The resolution parameter controls the granularity of the clustering. Higher values of resolution will result in more clusters, while lower values will result in fewer clusters. The default value is 0.8.
From BioStars:
FindClusters() is a function used for clustering data points into groups or clusters based on their similarity. It uses a graph-based clustering approach and a Louvain algorithm. Clustering is an unsupervised learning technique where the algorithm groups similar cells together without any predefined labels. The goal is to find patterns and structure in your data. The number of clusters and the algorithm used can vary based on the problem and data characteristics. Common clustering algorithms include K-means, hierarchical clustering, and DBSCAN.
where ‘algorithm’ =
1 - original Louvain algorithm
2 - Louvain algorithm with multilevel refinement
3 - SLM (Smart Local Moving) algorithm
4 - Leiden algorithm
See ?FindClusters for more information.
Clustering
Clustering is considered a classical unsupervised machine learning task. Seurat uses the community detection algorithm, Leiden, to identify cell clusters. A community detection algorithm identifies groups of nodes in a network that are densely connected. The Leiden algorithm connects cells by finding tightly connected communities in the shared nearest neighbor graph computed in FindNeighbors. Shared nearest neighbor graphs are more robust than k-nearest neighbors graphs. sNN graphs weight edges based on the number of shared edges with other cells. Leiden is a 2019 improvement on the Louvain algorithm so it is common to see both in scRNA-seq analysis.
In clustering, the goal is not to see how many clusters you can pull apart but it is an iterative process. Especially in the first pass, you want to pull apart main cell groups such as epithelial cells and immune cells so you can further refine clusters to extract more granular cell types in the next iteration.
After clustering, we’ll review some cluster validation techniques to qualitatively and quantitatively check the quality of the clustering results.
se <- FindClusters(
object = se,
resolution = c(0.01, 0.05, 0.1, 0.15, 0.2,0.25),
algorithm = 1) Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 59572
Number of edges: 2044941
Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9958
Number of communities: 4
Elapsed time: 10 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 59572
Number of edges: 2044941
Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9853
Number of communities: 9
Elapsed time: 12 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 59572
Number of edges: 2044941
Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9765
Number of communities: 12
Elapsed time: 11 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 59572
Number of edges: 2044941
Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9708
Number of communities: 16
Elapsed time: 11 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 59572
Number of edges: 2044941
Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9661
Number of communities: 18
Elapsed time: 11 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 59572
Number of edges: 2044941
Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9621
Number of communities: 18
Elapsed time: 13 seconds
# Results seen at RNA_snn_res in metadata
colnames(se@meta.data)[str_detect(colnames(se@meta.data), "RNA_snn_res")][1] "RNA_snn_res.0.01" "RNA_snn_res.0.05" "RNA_snn_res.0.1" "RNA_snn_res.0.15" "RNA_snn_res.0.2" "RNA_snn_res.0.25"
KNN vs SNN
Subset Seurat object to visualize KNN and SNN
se_sub <- se[, sample(colnames(se), 1000)] # Subset to 1000 cells
se_sub <- se_sub %>%
FindNeighbors(
reduction = "pca",
dims = 1:30,
k.param = 10,
verbose = FALSE,
return.neighbor = FALSE,
compute.SNN = TRUE
)Now let’s do some processing to visualize the KNN and SNN graphs
# Extract PCA
pca_coords <- se_sub@reductions$pca@cell.embeddings[, 1:3] %>%
as.data.frame() %>%
rownames_to_column("name")
## Edges for KNN graph
edges_knn <- data.frame(se_sub@graphs$RNA_nn, check.names = FALSE) %>%
tibble::rownames_to_column("source") %>%
pivot_longer(cols = -source, names_to = "target") %>%
dplyr::filter(source != target & value > 0) %>%
dplyr::select(-value)
## Edges for SNN graph
edges_snn <- data.frame(se_sub@graphs$RNA_snn, check.names = FALSE) %>%
tibble::rownames_to_column("source") %>%
pivot_longer(cols = -source, names_to = "target", values_to = "jaccard") %>%
# Prune SNN - for this use case we need to have a jaccard index > 0.14
dplyr::filter(source != target & jaccard > 1 / 7) %>%
dplyr::select(-jaccard)
# Prune SNN - for this use case we need to have a jaccard index > 0.14. This is
# a strict threshold for visualization purposes to highlight the robustness to
# outliers
# Create graph objects
graph_knn <- tbl_graph(
edges = edges_knn,
nodes = pca_coords,
directed = FALSE) %>%
activate(nodes)
graph_snn <- tbl_graph(
edges = edges_snn,
nodes = pca_coords,
directed = FALSE) %>%
activate(nodes)Visualize KNN and SNN graph
# Visualize KNN graph
pknn12 <- ggraph(graph_knn, layout = 'manual', x = PC_1, y = PC_2) +
geom_edge_link(aes(alpha = 0.5), show.legend = FALSE) +
geom_point(aes(x = PC_1, y = PC_2), color = "blue", size = 3) +
theme_minimal() +
labs(title = "KNN in PCA Space", x = "PC_1", y = "PC_2")
# Visualize SNN graph
psnn12 <- ggraph(graph_snn, layout = 'manual', x = PC_1, y = PC_2) +
geom_edge_link(aes(alpha = 0.5), show.legend = FALSE) +
geom_point(aes(x = PC_1, y = PC_2), color = "orange", size = 3) +
theme_minimal() +
labs(title = "SNN in PCA Space", x = "PC_1", y = "PC_2")
pknn12 | psnn12# Looking also PC1 vs PC3 for a different perspective
pknn13 <- ggraph(graph_knn, layout = 'manual', x = PC_1, y = PC_3) +
geom_edge_link(aes(alpha = 0.5), show.legend = FALSE) +
geom_point(aes(x = PC_1, y = PC_3), color = "blue", size = 3) +
theme_minimal() +
labs(title = "KNN in PCA Space", x = "PC_1", y = "PC_3")
# Visualize SNN graph
psnn13 <- ggraph(graph_snn, layout = 'manual', x = PC_1, y = PC_3) +
geom_edge_link(aes(alpha = 0.5), show.legend = FALSE) +
geom_point(aes(x = PC_1, y = PC_3), color = "orange", size = 3) +
theme_minimal() +
labs(title = "SNN in PCA Space", x = "PC_1", y = "PC_3")
pknn13 | psnn13We can see that by looking at the SNN graphs we prune connections between extreme points and, therefore, improve the robustness of our clustering.
RunUMAP
UMAP runs on the PCA space. The dims parameter specifies which dimensions of the PCA space to use for the UMAP calculation. The default value is 1:30, which uses all dimensions. The n.components parameter specifies the number of dimensions in the UMAP embedding. The default value is 2.
se <- se %>%
RunUMAP(dims = 1:30, verbose = FALSE) # Run UMAP Visualize Clusters
DimPlot(
se,
group.by = c(
"RNA_snn_res.0.01", "RNA_snn_res.0.05",
"RNA_snn_res.0.1", "RNA_snn_res.0.15",
"RNA_snn_res.0.2", "RNA_snn_res.0.25"),
label = TRUE
)Community detection doesn’t follow a hierarchical clustering process. The algorithm splits the graph into more communities at higher resolution and does not subcluster based on major clusters that are present at lower resolutions. Clustering at a low resolution splits the dataset into a certain number of groups while clustering at a high resolution draws different lines on that same larger corpus of data. Below we show how this method of clustering (community detection) is different from hierarchical clustering.
se@meta.data %>%
dplyr::count(
Celltype, RNA_snn_res.0.01, RNA_snn_res.0.05,
RNA_snn_res.0.1, RNA_snn_res.0.25) %>%
ggplot(
aes(axis1 = RNA_snn_res.0.01, axis2 = RNA_snn_res.0.05,
axis3 = RNA_snn_res.0.1, axis4 = RNA_snn_res.0.25,
y = n)) +
geom_alluvium(aes(fill = RNA_snn_res.0.01), width = 0.1) +
geom_stratum(width = 0.1) +
geom_text(stat = "stratum", aes(label = after_stat(stratum))) +
scale_x_discrete(
limits = c("RNA_snn_res.0.01", "RNA_snn_res.0.05",
"RNA_snn_res.0.1", "RNA_snn_res.0.25"),
expand = c(0.15, 0.05)) +
scale_fill_brewer(palette = "Dark2") +
labs(title = "Compare Cluster Resolution of 0.01, 0.05, 0.1 and 0.25",
x = "Clusters",
y = "Count") +
theme_minimal()We see how in this case it follows a pretty good hierarchical splitting across higher resolutions some cells do tend to branch out into other clusters.
Different Resolutions
0.05 vs 0.1 Resolution
DimPlot(
se,
group.by = c(
"Celltype","RNA_snn_res.0.05", "RNA_snn_res.0.1"),
label = TRUE
)Comparing the 0.05 and the 0.1 resolutions, Cluster 0 in the 0.05 resolution appears to split into 2 looking at Clusters 1 and 2 in resolution 0.1. While this seems helpful in sifting through the data, we’ve decided to keep the large blobs together. This saves us visibility of our dataset in the second level of annotation. Resolution 0.05 has the best distribution, but let’s quantitatively analyze it’s clusters quality.
Cluster Metrics
Cluster Diversity
Splitting the clusters by cluster size and sample diversity allows us to make sure not one sample is being clustered separately from others.
# seurat_clusters <- "RNA_snn_res.0.05"
diversityPerGroup <- function(df, species, group, diversity_metric = 'simpson') {
# Convert strings to symbols for curly-curly operator
species_sym <- rlang::sym(species)
group_sym <- rlang::sym(group)
# Count groups per species directly using curly-curly
tierNcount <- df %>%
group_by({{species_sym}}) %>%
count({{group_sym}}, name = "n") %>% ungroup
# Pivot table to allow vegan::diversity call
tierNwide <- tierNcount %>%
pivot_wider(names_from = {{group_sym}}, values_from = n, values_fill = list(n = 0))
# Use rownames of species for the diversity function, which needs a dataframe
tierNwide_df <- as.data.frame(tierNwide)
# species column must be first
tierNwide_df <- tierNwide_df %>% select({{species}}, everything())
rownames(tierNwide_df) <- tierNwide_df[, 1]
tierNwide_df <- tierNwide_df[, -1]
# Calculate diversity
diversity_values <- vegan::diversity(tierNwide_df, index = diversity_metric)
# Prepare result as a dataframe
result <- data.frame(
species = rownames(tierNwide_df),
diversity = diversity_values,
row.names = NULL
)
# Rename diversity column
names(result)[1] <- species
names(result)[2] <- sprintf('%s_diversity', group)
return(result)
}
# Calculate simpson's diversity per cluster
clusterMetrics <- diversityPerGroup(se@meta.data,
species = 'RNA_snn_res.0.05',
group = 'Sample ID',
diversity_metric = 'simpson')
# Calculate number of cells per cluster and join to metrics table
clusterMetrics <- clusterMetrics %>% left_join(se@meta.data %>% count(RNA_snn_res.0.05))
head(clusterMetrics) RNA_snn_res.0.05 Sample ID_diversity n
1 0 0.913441363 23573
2 1 0.923354092 17543
3 2 0.922488819 6591
4 3 0.875371149 3756
5 4 0.007659878 2865
6 5 0.621039497 2688
Let’s visualize Simpson’s diversity index by cluster
ggplot(clusterMetrics, aes(x = RNA_snn_res.0.05, y = n)) +
geom_segment(aes(
x = RNA_snn_res.0.05, xend = RNA_snn_res.0.05,
y = 0, yend = n),
size = 1.5, color = 'grey80') + # Draw lines for lollipops
geom_point(aes(color = `Sample ID_diversity`), size = 5) + # Add colored lollipop 'heads', coloring by 'Sample ID_diversity'
scale_y_log10() +
# scale_x_continuous(breaks = seq(0, 20)) +
scale_colour_viridis(
option = "C",
name = "Sample ID Diversity",
direction = 1,
limits = c(0, 1)) + # Colorblind-friendly, vibrant color palette
theme_minimal(base_size = 10) +
theme(legend.position = "bottom",
axis.text = element_text(size = 12),
axis.title = element_text(size = 14),
title = element_text(size = 16)) +
labs(x = "Clusters",
y = "log(cluster size)",
title = "Simpson Diversity Index per Cluster") +
guides(colour = guide_colourbar(title.position = "top", title.hjust = 0.5))Clusters 4 and 6 appear to have very low sample diversity. Let’s investigate which samples are in each cluster to see if there is any bias.
Plot sample distribution per cluster
# Prepare the data for plotting
plot_sample <- se@meta.data %>%
count(RNA_snn_res.0.05, `Sample ID`) %>%
group_by(RNA_snn_res.0.05) %>%
mutate(proportion = n / sum(n)) %>%
ungroup()
# Create a stacked bar plot
ggplot(
plot_sample,
aes(x = factor(RNA_snn_res.0.05),
y = proportion,
fill = `Sample ID`)) +
geom_bar(stat = "identity", position = "stack") +
labs(
title = "Distribution of Sample IDs Across Clusters",
x = "Seurat Clusters",
y = "Proportion",
fill = "Sample ID") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_manual(values = donor_pal)Sample Flu 1 appears to make up the majority of Cluster 4 and Flu 3 of cluster 6. Looking at it with a table -
table(se$`Sample ID`, se$RNA_snn_res.0.05)
0 1 2 3 4 5 6 7 8
Flu 1 241 39 30 4 2854 1619 0 100 8
Flu 2 413 639 243 30 0 3 0 2 1
Flu 3 811 141 20 51 4 36 1533 5 0
Flu 4 395 442 157 20 0 19 5 1 1
Flu 5 287 76 1 5 0 13 1 269 0
nCoV 1 1035 1630 723 798 3 88 0 186 1
nCoV 10 834 1572 554 188 0 79 0 8 4
nCoV 11 1320 1971 630 453 3 41 0 2 5
nCoV 2 3513 747 479 175 0 68 0 10 7
nCoV 3 172 155 18 49 0 4 0 0 0
nCoV 4 132 372 163 24 0 20 2 1 0
nCoV 5 1962 1284 491 164 0 70 1 2 4
nCoV 6 2257 1116 594 299 0 257 0 0 3
nCoV 7 305 397 90 660 0 48 0 2 0
nCoV 8 377 439 184 510 0 79 0 284 0
nCoV 9 298 678 250 102 1 15 0 0 1
Normal 1 2852 615 809 18 0 32 0 2 3
Normal 2 2444 1701 393 40 0 45 0 2 21
Normal 3 1959 1868 410 134 0 69 0 31 19
Normal 4 1966 1661 352 32 0 83 0 11 18
Another way to visualize cluster diversity is through a stacked bar plot instead of lollipops. This provides a super clear way to compare these numerical metrics side by side.
ggplot(clusterMetrics,
aes(
x = 1,
y = as.character(RNA_snn_res.0.05),
fill = `Sample ID_diversity`,
label = n)) +
geom_tile(colour = "white") +
geom_text(nudge_x = 1.5, size = 3) +
geom_text(aes(label = signif(`Sample ID_diversity`, 2)),size = 3) +
scale_fill_distiller(palette = "Spectral", limits = c(0,1)) + theme_classic() +
coord_fixed(ratio = 0.25) +
expand_limits(x = c(0.5,3)) +
labs(
x = "Diversity Size",
y = "RNA_snn_res.0.05",
fill = "Simpson Diversity\nfor Sample") +
theme(axis.text.y = element_text(hjust = 1, vjust = 0.5, size = 12),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_text(size = 15),
legend.key.size = unit(1, 'cm'),
legend.title = element_text(size = 10),
legend.text = element_text(size = 10)
)Visualizing cluster diversity based on batch is also an important thing to look out for. This is similar to the QC motivation where these plots provide a way to explain your data and possible findings before claiming any biological phenomena.
# Prepare the data for plotting
se@meta.data %>%
count(RNA_snn_res.0.05, batch) %>%
group_by(RNA_snn_res.0.05) %>%
mutate(
proportion = n / sum(n),
batch = as.character(batch)) %>%
ungroup() %>%
# Create a stacked bar plot
ggplot(
aes(x = factor(RNA_snn_res.0.05),
y = proportion,
fill = batch)) +
geom_bar(stat = "identity", position = "stack") +
labs(x = "Seurat Clusters", y = "Proportion", fill = "Batch",
title = "Distribution of Batches Across Clusters") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_color_manual(values = donor_pal)Silhouette Analysis
As covered in our workshop, silhouette analysis is a way to measure how similar each cell is to the other cells in its own cluster compared to ones in other clusters. The silhouette value ranges from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters and vice versa.
seurat_clusters <- as.character(se@meta.data$RNA_snn_res.0.05)
pca_embeddings <- Embeddings(se, reduction = 'pca')
# Calculate silhouette widths
sil_scores <- silhouette(x = as.numeric(seurat_clusters), dist = dist(pca_embeddings))
# Extract silhouette scores
silhouette_data <- as.data.frame(sil_scores)
head(silhouette_data) cluster neighbor sil_width
1 0 2 0.4385896
2 0 2 0.3529505
3 0 5 0.3056284
4 4 6 0.2638074
5 0 2 0.2766471
6 2 0 0.4487648
# recover cell type names
row.names(silhouette_data) <- row.names(pca_embeddings)
silhouette_arranged <- silhouette_data %>%
mutate(cluster = as.character(cluster)) %>%
group_by(cluster) %>%
arrange(-sil_width)Visualize silhouette scores
overall_average <- silhouette_arranged %>%
ungroup %>%
summarize(ave = as.numeric(mean(sil_width))) %>%
pull(ave)
ggplot(
silhouette_arranged,
aes(x = sil_width,
y = cluster,
fill = cluster,
group = cluster)) +
geom_bar(stat = "identity", position = 'dodge2') +
geom_vline(xintercept = overall_average,
color = 'red',
linetype = 'longdash') +
theme_minimal() +
labs(title = "Silhouette Analysis",
y = "Cluster",
x = "Silhouette width",
fill = "Cluster") +
theme(axis.text.y = element_text(hjust = 1, vjust = 0.5, size = 20),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_text(size = 20),
legend.position = "None")The dashed red lines shows the average silhouette width across all clusters. Clusters 1 and 5 stick out here and we can look at their distribution on a UMAP for more context.
So let’s look at silhouette scores on a UMAP
d5 <- DimPlot(se,
reduction = 'umap',
group.by = 'RNA_snn_res.0.05',
label = TRUE)
se$CellID <- rownames(se@meta.data)
sil_ids <- silhouette_data %>%
rownames_to_column('CellID') %>%
left_join(se@meta.data, by = 'CellID')
se <- AddMetaData(se, sil_ids)
FeaturePlot(
se,
feature = "sil_width") +
ggtitle('Silhouette width') +
scale_color_distiller(
type = "div",
palette = "BrBG") | d5Cluster 5 seems to have a specifically poor silhouette score. This potentially means these cells are not clustered correctly. This could be due to this cluster being under-clustered - leading to multiple cell types/states ending up within it. Let’s check if there is heterogeneity within cluster 5. In the next notebook we will take a closer look on what might be going on
Save Seurat object
saveRDS(se, file = "../data/clustered_se.rds")Extra: More on Clustering Algorithms
The Louvain algorithm was developed in 2008 and is a popular community detection algorithm used in scRNA-seq. It recursively merges communities into a single node and executes the modularity clustering on the condensed graphs.(Zhang) Both Seurat and scanpy use Louvain as the default clustering algorithm.
Leiden Algorithm
The Leiden algorithm was published in 2020 as an improvement of the Louvain algorithm. Leiden creates clusters by taking into account the number of links between cells in a cluster versus the overall expected number of links in the dataset. It computes a clustering on a KNN graph obtained from the PC reduced expression space. It starts with an initial partition where each node from its own community. Next, the algorithm moves single nodes from one community to another to find a partition, which is then refined. Based on a refined partition an aggregate network is generated, which is again refined until no further improvements can be obtained, and the final partition is reached. (Single Cell Best Practices).
There are a couple of popular clustering algorithms. There is no one way to cluster as clustering is a means of looking at the data from different angles. The most popular clustering algorithms are the louvain algorithm, leiden algorithm, hierarchical clustering, and k-means clustering. Seurat uses the Leiden algorithm by default which is an improvement on the Louvain algorithm.
Session Info
sessionInfo()R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.4
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggraph_2.1.0 tidygraph_1.3.1 ggalluvial_0.12.5 viridis_0.6.4 viridisLite_0.4.2 cluster_2.1.4 DoubletFinder_2.0.4 colorBlindness_0.1.9 RColorBrewer_1.1-3 lubridate_1.9.2 forcats_1.0.0 stringr_1.5.1 purrr_1.0.2 readr_2.1.4 tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.0 tidyverse_2.0.0 Seurat_5.1.0 SeuratObject_5.0.2 sp_2.1-3 dplyr_1.1.4
loaded via a namespace (and not attached):
[1] jsonlite_1.8.8 magrittr_2.0.3 spatstat.utils_3.0-4 farver_2.1.1 rmarkdown_2.26 vctrs_0.6.5 ROCR_1.0-11 spatstat.explore_3.2-6 htmltools_0.5.7 gridGraphics_0.5-1 sctransform_0.4.1 parallelly_1.37.0 KernSmooth_2.23-22 htmlwidgets_1.6.4 ica_1.0-3 plyr_1.8.9 plotly_4.10.4 zoo_1.8-12 igraph_2.0.2 mime_0.12 lifecycle_1.0.4 pkgconfig_2.0.3 Matrix_1.6-5 R6_2.5.1 fastmap_1.1.1 fitdistrplus_1.1-11 future_1.33.1 shiny_1.8.0 digest_0.6.34 colorspace_2.1-0 patchwork_1.2.0 tensor_1.5 RSpectra_0.16-1 irlba_2.3.5.1 vegan_2.6-4 labeling_0.4.3 progressr_0.14.0 timechange_0.2.0 fansi_1.0.6 spatstat.sparse_3.0-3 mgcv_1.9-0 httr_1.4.7 polyclip_1.10-6 abind_1.4-5 compiler_4.3.1 withr_3.0.0 fastDummies_1.7.3 ggforce_0.4.1 MASS_7.3-60 permute_0.9-7 tools_4.3.1
[52] lmtest_0.9-40 httpuv_1.6.14 future.apply_1.11.1 goftest_1.2-3 glue_1.8.0 nlme_3.1-163 promises_1.2.1 grid_4.3.1 Rtsne_0.17 reshape2_1.4.4 generics_0.1.3 gtable_0.3.4 spatstat.data_3.0-4 tzdb_0.4.0 hms_1.1.3 data.table_1.15.4 utf8_1.2.4 spatstat.geom_3.2-8 RcppAnnoy_0.0.22 ggrepel_0.9.5 RANN_2.6.1 pillar_1.9.0 spam_2.10-0 RcppHNSW_0.6.0 later_1.3.2 splines_4.3.1 tweenr_2.0.2 lattice_0.21-8 survival_3.5-7 deldir_2.0-2 tidyselect_1.2.0 miniUI_0.1.1.1 pbapply_1.7-2 knitr_1.45 gridExtra_2.3 scattermore_1.2 xfun_0.42 graphlayouts_1.0.0 matrixStats_1.2.0 stringi_1.8.3 lazyeval_0.2.2 yaml_2.3.8 evaluate_0.23 codetools_0.2-19 cli_3.6.3 uwot_0.1.16 xtable_1.8-4 reticulate_1.35.0.9000 munsell_0.5.0 Rcpp_1.0.12 globals_0.16.2
[103] spatstat.random_3.2-2 png_0.1-8 parallel_4.3.1 ellipsis_0.3.2 dotCall64_1.1-1 listenv_0.9.1 scales_1.3.0 ggridges_0.5.6 leiden_0.4.3.1 rlang_1.1.3 cowplot_1.1.3